Individual random effects model for differences in trait distribution among respondents

The homogeneity hypothesis is a common assumption in classic measurement. However, the item response theory model assumes that different respondents with same ability have the same option probabilities, which may not hold. The aim of this study is to propose a new individual random effect model that accounts for the differences in option probabilities among respondents with same latent traits by using within-person variance. The performance of the new model is evaluated through simulation studies and real data using the PRESUPP scale of PISA. The model parameters are estimated by the MCMC method. The results show that the individual random effect model can provide more accurate parameter estimates and obtain a scale parameter to describe the distribution of respondents’ abilities, under different within-person variances. The new model has lower RMSE and better model fit than the classic IRT model.

of a correct response decreasing as a function of the distance between the respondent's and the item's position in the latent space 18 .Although these models have been successfully applied in practice, they were not free of limitations.For example, the random item effect model and finite mixture model required the background information of the respondents to be known before data analysis and needed to ensure homogeneity within the subgroups.The bifactor model and the higher-order factor model introduced at least one additional factor to explain the responses pattern of the respondents, and it did not work in the unidimensional condition.The response styles model required more than two options of each item.The interaction model and latent space model, which were the most flexible, respectively proposed interaction variables and latent space distance to describe the interaction between the respondents and the items, but these two parameters were relative measurements.They expressed the relationship between a specific respondent and a specific item, and thus, they were suitable for the interpretation of the secondary factors.In practice, sometimes we cannot find and define secondary factors, even though some scales do not involve secondary factors.The two-dimensional latent space model had the problem of insufficient explanatory power when there were too many items and respondents, while the multidimensional latent space model was not easy to calculate and understand.
Personality researchers were initially interested in differences of within-person variance and measure it by repeatedly administering the same items [19][20][21] .Recently, based on their research, Williams 22 proposed the Bayesian nonlinear mixed effects location-scale model (NL-MELSM).This model allows within-person variance to follow a nonlinear trajectory in learning, which can determine whether variability is reduced during learning.Lin 23 proposed a multiple sharing parameter model, where the longitudinal outcomes of multiple densities are modeled by the mixed-effect location-scale model and further linked to the corresponding deletion mechanism through the shared respondent random effect.However, it cannot be ignored that previous exposure to test will influence the performance on the test.Even if retesting were done under identical conditions, the examinee is no longer the same person in the sense that relevant experiences have occurred that had not occurred before the first testing 24 .
Williams et al. 25 proposed a perspective that went beyond homogeneous variance and viewed modeling within-person variance as an opportunity to gain a richer understanding of psychological processes.In this framework, a new model was introduced, where within-person variance is considered as a factor affecting respondents' option probabilities.The within-person variances are named individual random effects, which represents the magnitude of within-person variance of the respondents' ability, which is helpful to obtain the distribution characteristics of the respondents' ability.
The development of the new model is inspired by the generalizability theory (GT) and the research on within-person variance in psychological measurement.According to the GT, all measurements have variances, which may arise from the measurement tools, and the users of the tools not mastering the essentials, while the measurement conditions, environments or the respondents do not cooperate.In short, there are various sources for measurement variance, include within and between persons.
On the other hand, some researchers have argued that individual internal variation is not only inevitable, but also meaningful, from the perspectives of both psychological mechanisms and mathematical statistics [26][27][28] .All the measurement variances are not meaningless, it is the nonnegative parameter included in the complete measurement result, which reasonably gives the measured value with dispersion.Omission or repeated consideration of variance sources result in reflecting incorrect measurement results of the actual measurement status and affect the validity and reliability of measurement 29 .Classic measurement tends to assume that the respondents are homogeneous within the group, and the variance stems from the limitation of measurement tools.In the framework of the mixed-effect location-scale model, the goal of psychological measurement involves not only the measurement of location (or mean) but also the measurement of scale (or within-person variance).Withinperson variance is considered not only to reflect measurement error but also to reflect system information 30 .
Ferrando 31 proposed the same model based on Thurstone scaling, but eventually simplified it for the ease of parameter estimation, and used a two-stage parameter estimation method that may introduce some errors.
In this study, we introduced the individual random effect model (IREM) by combining the mixed-effect location-scale model and the IRT.The within-person variance in the model is incorporated into the IRT model as the respondents' parameter.The differences in option probabilities among respondents are regarded as the result of different within-person variances.This change in perspective comes with important benefits, such as: (a) We work with the original item response data rather than functions of item response data, (b) we estimate withinperson variance as the respondents' parameter, rather than integrating it into item parameters, which facilitates a richer understanding of psychological processes, © we use the data of one measurement, not longitudinal data, for parameter estimation to avoid within-person variance being confounded with the change of the mean, (d) our approach is closely related to the IRT model, which facilitates interpretation.
To demonstrate the advantages of the model, we derive the IRT model and the variance decomposition principle, and then compare it with the classic IRT model through simulation and real data studies.In conclusion, the model in this study introduces a new scale parameter and has certain benefits in the estimated accuracy of θ .It is expected to offer a new perspective for studying the different option probabilities among respondents in the IRT model.

Model Normal ogive model
In 1952, Lord proposed the first IRT model, the two-parameter normal ogive curve model, and applied this model to the measurement of academic achievement and attitude 2 .It included three basic assumptions: (a) unidimensional, (b) local independence, and (c) the formal hypothesis of the item characteristic curve, namely, the monotone increasing hypothesis, which is given by the basic equation: where b is the difficulty parameter, a is the slope parameter for the item, and θ represents the respondent's ability; it represents the area under the standardized normal curve of the Z score from −∞ to a(θ − b).

Mathematical derivation of the individual random effect model
The individual random effects model can be derived based on the mathematical foundation of the normal ogive model.Suppose respondent i has ability θ and a binary item j has difficulty parameter b.Y is defined as the observed response value (score).There must be a threshold η , whenever θ is greater than η , Y = 1 ; otherwise, Y = 0 .The distribution of η is normal, with the mean denoted by b and the variance denoted by σ 2 ; thus, the frequency distribution of η is given by Let t = η−b σ , then t ∼ N(0,1) and η = tσ + b .The probability that the respondent will score 1 on the item is Let a = 1 σ , then it will be the same normal ogive model as (1), which consider θ to be a fixed value throughout the test and threshold η goes up and down around b.
As Williams et al. 25 proposed, any estimate of an individual ability, θ , was an estimate of an average, and it was assumed that the real ability, θ * , was normal, with the mean denoted by θ and the variance denoted by ε 2 ; thus, the frequency distribution of θ * is given by, Then, the probability of respondent i endorsing item j is given by, Let z = θ * − η and t′ = z−(θ −b) √ ε 2 +σ 2 , then z ∼ N θ − b, ε 2 + σ 2 and t ∼ N(0,1) and t′ = z √ ε 2 + σ 2 + (θ − b) .The probability that the respondent will score 1 on the item is, The integration function of normal ogive model cannot be expressed as elementary function, which is difficult to use in practice.This urges people to look for alternative models, and the logistic model is proposed in this context.
Haley 32 proved that for X ∈ R , the relationship between logistic model and normal ogive model can be stated as Therefore, the logistic model can be used as an approximation of the normal ogive model, which is much easier to calculate.Derived from (7) and (8), the new model assumes the following item response function, It is worth noting that when ε 2 is constant, the individual random effects model is equivalent to the two- parameter IRT model.Thus, the individual random effects model can be regarded as a generalization of the two-parameter model.In practice, we determine whether ε 2 is constant via model selection, as described in Sect."Simulation study".The value of the individual random effects model is that it explores the source of differences in the option probabilities of different respondents with the same ability on the same item under the unidimensional model, describing this by the difference of ε 2 . (1) . ( . Vol:.( 1234567890) The individual random effects model described above is derived from the classic IRT model and can be regarded as a special two-parameter mixed model.Specifically, the respondent compares θ * with η , and when θ * > η , the respondent obtains a score of "1" for the item.However, there is variance both from the item and the respondent, and θ , b , ε 2 and σ 2 jointly affect the relative position of θ * and η , thereby affecting the respondent's response.
Compared with the classic IRT model, the individual random effects model does not require homogeneity in groups.It is worth noting that the respondent group is not always heterogeneous.Identifying heterogeneity and whether the use of individual random effects models in a homogeneous group leads to undesirable results needs to be explored in our research.
There are many similarities between the individual random effects model and the mixed model.They both show that the item parameters are different for different respondents.According to the mixed model, in different subgroups, the same item may have different slope and difficulty parameters, which is caused by the specific culture or background of the subgroups.The purpose of the mixed model is to distinguish the differences between subgroups and estimate the respondents' parameters more accurately.The individual random effects model focuses on the heterogeneity that may exist between any two respondents to estimate a new parameter and to calculate the different distribution of the respondent's ability, which is meaningful for predicting individual behavior.

Practical advantages
A unique advantage of the proposed individual random effects model is that it provides a parameter used to explain the differences in the option probabilities of respondents with the same ability.Due to different withinperson variances, the respondents with the same ability show different option probabilities, which is in line with reality.At the same time, we can effectively obtain the specific distribution of the respondents' abilities by estimating the relevant variance of the respondents, which is important for improving measurement information and predicting individual behavior.

Theoretical advantages
One of the theoretical advantages of the proposed individual random effects model is that it weakens the conditional independence assumption of the classic IRT model and the homogeneity assumption of the classic measurement.

Conditional independence assumptions. The proposed individual random effects model is based on the following conditional independence assumption:
where the Y the full response matrix and θ = (θ 1 , . . . In words, the item responses are assumed to be independent conditional on the ability of the respondents, the difficulty of the items, and the variance caused by the respondents and the items.This conditional independence assumption is weaker than the conditional independence of the classic IRT model.The classic IRT model requires the (conditional) distribution of item scores to be independent of each other within any respondent group, and the item scores are only related to the ability θ 33 .
The weaker conditional independence assumption of the individual random effects model allows for differences between respondents or items with the same θ or b.

Homogeneity assumptions.
The assumption of homogeneity is a prerequisite for classic measurements, including classic IRT.In measurement, homogeneity is manifested in that the respondents of different subgroups are indistinguishable: the subgroups have the same scale, similar knowledge structure and backgrounds, and there is no difference in the overall variance between the different subgroups.The formula can be expressed as y ij = β 0 + u 0i + ǫ ij , where β 0 is the fixed effect and u 0i is the individual deviation.ǫ ij are residuals.Under the condition of homogeneity, they are assumed to be normal distributions with constant variance, i.e.
This is also the basis for the same option probabilities of respondents with the same ability.In measurement, homogeneity cannot be strictly guaranteed.Individual random effects models can estimate within-person variance.We are not treating homogeneous variance as a hypothesis that needs to be satisfied or treating withinperson variance as noise.Rather, within-person variance is considered as a factor that affects the respondent's option probabilities and is included in the estimate.

Parameter estimate
When estimating the parameters of the complex probability density equation, the MCMC method is easier than other methods.Therefore, we use the MCMC method to estimate the parameters of the individual random effects model and implement the method based on the RSTAN.
Referring to previous studies 34 , we use the following priors:

Simulation study
To compare the individual random effects model and the classic IRT models, Monte Carlo simulations are used.
Compared with the classic model, the individual random effects model mainly incorporates the within-person variance of the respondents.The main factors that have substantial impacts on the accuracy of parameter estimate are the length of the test and the number of respondents.Therefore, this experiment contains 3 independent variables: (a) sample size (200, 500, 1000), (b) test length (20, 30, 50), and (c) the scale of within-person variance ( σ P is constant or log-normal distribution, which is the classic two-parameter model and new model).To reduce random errors, the simulation is repeated 30 times under each condition, and the results are averaged.

Generation of respondents' parameters
The number of respondents contains three levels: N = 200, 500, 1000.The within-person variances of respondents have two levels: the within-person variances are different where lnε ∼ normal(0, 1) and the within-person variances are constant where ε ≡ 1.

Generation of items' parameters
The number of items contain three levels, K = 20, 30, and 50.The difficulty of the items is generated according to b ∼ normal(0, 1) , and the slope of the 2PL model item is generated according to lnσ ∼ normal(0, 1).

Data analysis
To assess the model's estimate accuracy, the following two indicators are used to measure the model's estimate accuracy of the tested ability parameters: (1) The root mean square error: (2) The coefficient of deviation: where θ is the estimate of mean ability, θ is the real mean ability, and N is the number of respondents.
(3) ε 's standard deviation S ε : µ bj ∼ cauchy(0, 5), j = 1, . . ., K σ bj ∼ cauchy(0, 5), σ bj > 0, j = 1, . . ., K σ aj ∼ cauchy(0, 5), σ aj > 0, j = 1, . . ., K lnσ j ∼ normal(0, 1), j = 1, . . ., K In practice, for a given dataset, it is natural to consider whether it is a classic IRT model with a constant ε or an individual random effects model with a variable ε .If the data is generated by an individual random effects model, then S ε is greater than zero and the data analysis for the respondents and items should be based on the individual random effects model with the variable ε parameter, otherwise, the classic IRT model is sufficient.The calculation is as follows: (4) Correlation coefficient of ε and ε: ε is the parameter describing the size of the within-person variance.Therefore, it is worth exploring whether it can be estimated effectively.We use the correlation coefficient of ε and ε to describe the validity of the estimates.

Results
Table 1 and Fig. 1 show the RMSE under different conditions.
The response data is generated based without any differences in within-person variances and differences in within-person variances (that is, the 2PL model and the IREM).The results in Table 1 show that for the data generated by 2PL, regardless of how the test length and sample size change, the RMSE of parameter estimate using the new model is considerably smaller than that of the 1PL and similar to that of the 2PL.The results in Table 2 show that when the data is generated by the new model, the RMSE of parameter estimate using the new model is considerably smaller than 1PL and 2PL, and the test length has an important influence on the parameter recovery.As the length of the test increases, the new model has a more substantial downward trend than 1PL and 2PL, which shows that the IREM can provide more robust and accurate estimates (see Fig. 1).It is worth noting that, under all conditions, the RMSE is greater than 0.3, which is related to the larger random variation of the simulation setting, because , when the item variance and the respondent withinperson variance mean is 1, the average slope is approximately 0.71, and the amount of item information is small.When the length of the test is limited, the standard error of the test is large 37 .Under all conditions, the bias value is close to 0 (less than 0.05), indicating that regardless of whether the 2PL model or the IREM is used to generate the response matrix, the point estimates of all models are unbiased estimates of the respondent's ability.
For the data generated by different models, we need to perform model selection.The difference between the IREM and classic IRT models is whether ε changes among respondents.Figure 2 shows the standard deviation of ε under different conditions.When the data is generated by an individual random effects model, S ε , the esti- mated standard deviation of ε , is always smaller than the real standard deviation S ε .When there is no individual random effect or the individual random effect is small, S ε is approximately equal to 0, and when the individual random effect is large enough, the standard deviation of ε is considerably greater than zero.These simulation results provide evidence that the proposed model selection method is helpful for determining whether the data conforms to the classic IRT model or the individual random effect model.In other words, the model selection method helps determine whether the classic IRT model is sufficient or whether there are differences in withinperson variance among respondents.In addition, individual random effects models can identify and estimate these deviations.

Data and estimate
As example, we use the PRESUPP scale that came from the 2015 Program for International Student Assessment (PISA).Ten items are included in the scale, which ask respondents how frequently their child engaged in sciencerelated learning activities at home when he or she was 10 years old, and then inquired about parents' support for science learning in the middle childhood years from the following 10 aspects: .
Table 1.RMSE values of potential trait levels of respondents under various conditions generated by 2PL.The response categories were "very often", "regularly", "sometimes", "never" and had to be reverse-coded so that higher WLEs and higher difficulty correspond to higher levels of parental support.To adapt to the binary model, the responses "very often" and "regularly" are recorded as "1", which refers to higher frequency.The responses "sometimes" and "never" are recorded as "0", which refers to lower frequency.
This study uses the Croatian subset of the data, which contains N = 5220 participants' responses on these 10 items.The mean proportion of "1" on each item is 0.02 to 0.61.To implement MCMC, we specify the priors, iterations and burn-in period as we described in Sect."Parameter estimate".The computation took approximately 365 min for the individual random effects model and 48 min for 2PL on a computer with Intel(R) Core (TM) i7-11700 K CPU and 64G RAM.Trace plots show reasonable convergence of the sampler.In addition, we used S. P. Brooks' 32 improved Gelman Rubin convergence statistics to detect possible non-convergence.We ran the model with three sets of random initial values.The scale reduction factor is smaller than 1.01 for all model parameters, suggesting that there are no signs of non-convergence.We implemented the model selection method described in Sect."Simulation study".According to the results of the simulation study, when S ε is approximately equal to zero, the data fits a classic IRT model.S ε is considerably greater than zero and the data fits the individual random effects model.The obtained S ε is 0.21, which means that the within-person variances are different.The data is more consistent with the new model.Therefore, we move forward with the individual random effects model for the current application.

Statement
This research involves the utilization of publicly available psychological measurement data from human participants.The dataset used in this study originates from Programme for International Student Assessment (PISA), which adheres to ethical guidelines and privacy policies.Throughout the research process, we have strictly followed the guidance and ethical principles provided by the relevant committee to ensure the protection of participants' privacy and rights.In the original dataset, all personally identifiable information has been removed, and data is presented in an anonymized manner to safeguard the privacy of the participants.The analysis and interpretation of the data focus solely on overall trends and patterns without involving any content that could potentially identify individual participants.

Goodness-of-fit analyses
Model fit often uses model fitting indices: − 2 log-likelihood values (− 2LL), the Akaike's information criterion (AIC) and the Deviance information criterion (DIC).But AIC and DIC need the sample size to be much larger than the number of parameters 38 , so they are not suitable for our IRT model.Stan used the WAIC and the LOO for model comparison and selection because they were completely based on Bayesian theory and were theoretically superior to classic information-based model selection indicators.In the context of IRT model selection, Luo Yong 39 studied the performance of the WAIC and the LOO on the dichotomously IRT model and found that they were superior to classic methods.Therefore, this study compares the fitness of the three models through model fitting indicators: − 2LL, WAIC and LOO.
Table 3 shows the model fitting indices of the three models.The results show that compared with the 1PL model and the 2PL model, the individual random effects model performs better on all the three fitting indices: -2LL, WAIC, and LOO.

Comparison with the classic IRT model
We compare the estimated parameter results with the classic IRT model.The classic IRT model also uses the MCMC method for estimate and has the same priori as the individual random effects model.
The estimated values of the respondent's ability of the three models are shown in Fig. 3.In general, the estimates of the three models are similar, and the correlation between the results of the individual random effects model and classic IRT models is 0.984 and 0.981.However, the estimate results of some particular respondents are different, and the difference comes from various parameter restrictions.The 2PL restriction has a constant within-person variance, and the 1PL has an additional restriction that all the items have a constant slope.The new model releases both these two restrictions.
The new model not only estimates the position parameter information of the respondent's ability but also estimates its scale information.For example, the abilities of respondents 371, 2754, and 3716 have the same abilities of the new model, which are all 1.45, but the option probabilities of the three respondents are quite different.The response matrix is shown in Table 4.The estimated values of 1PL for the three respondents' abilities are 1.56, 1.83, and 1.28; the estimated values of 2PL are 1.55, 1.61, and 1.33.The classic IRT model believes that this difference is caused by different abilities.The new model estimates different within-person variances based on the difference in option probabilities.If the parameter of within-person variances is introduced, the estimated abilities of the three respondents are shown in Fig. 4.
Figure 4 shows that the classic IRT model is not sensitive to the differences in the response pattern of the respondents, and the differences in response pattern are manifested as small differences in the ability.In the new

Summary
From the perspective of the mixed effect location scale model, we relax the restriction on the magnitude of the within-person variances in the IRT model and introduce the scale (within-person variances) parameter to construct a new IRT model.The new model is more flexible, more realistic, and has some theoretical significance and practical value.The advantage of the parameter estimation accuracy of the new model is demonstrated through a simulation study, and finally, the new model is compared with the classic IRT model using a real data study.The main research findings are: (1) A Monte Carlo simulation study showed that the individual random effects model can obtain an unbiased estimate of the respondent's ability, and its RMSE value is not larger than that of the classic IRT models.Moreover, when data is generated by the individual random effects model, the RMSE value of the individual random effects model is smaller than that of the classic IRT models, which suggests that the individual random effects model has better parameter estimation accuracy than the two-parameter model when there are differences in within-person variance.(2) When the data is generated by the individual random effect model, and the standard deviation of ε is large enough, the estimate of the standard deviation of ε can be used for model identification.As the individual random effect increases, the estimate of the standard deviation of ε also increases.

Limitation and possible applications
Although the 1PL and 2PL models are classic models and fit well in most cases, many studies have revealed that some respondents will deviate from the model, which implies that there are unexplainable differences among the respondents.We demonstrate the advantages of the individual random effects model through simulation studies.
In addition, we present evidence of individual differences in real data.At the same time, in some other datasets we tested (e.g.other country's subset of PRESUPP, the Neuroticism scales of the Eysenck questionnaire and a test of mathematics, the last two are reported in "Appendix A" of the supplement), we observe that differences in within-person variance also exist.However, it is worth noting that to ensure that the results of the research do not lose generality and to further advance related research in the future, the following aspects should be studied: (1) The new model can estimate the within-person variance of each respondent (the Pearson correlations between ε and ε in the simulation study ranged from 0.544, which increased as the number of items increased, to 0.692, providing evidence that the models are accurately implemented, as reported in "Appendix B" of the supplement).However, it requires too many items to get an accurate estimate of the withinperson variances, which necessitates the introduction of polytomous response formats to the new model.(2) Estimating the within-person variance can help detect undesired forms of response behavior.For example, in psychometrics, there are inadequate responses and false responses, and an abnormal increase in the within-person variance may indicate that the respondent's own condition is unstable or they do not respond seriously.However, the MCMC method used in this study underestimates the difference in the within-person variance and cannot accurately estimate its value, while the MCMC method takes a long time and is not conducive to practical application.It is necessary to further develop other parameter estimation methods for this purpose.(3) With the development of the IRT, researchers have proposed a large number of revised models, such as the four-parameter (4PL) model, which introduces lower asymptotic parameters (also known as guessing coefficients) and upper asymptotic parameters (also known as sleep coefficients) based on the classic 2PL model 40 .The Response-Time IRT Models consider the respondents' response time and add the item response time parameters 41 .The decision tree model (IRTree models) consider the respondent's preference response tendency for different positions 6 .These models all study the differences among respondents, and future research can focus on the differences and connections between the individual random effects models and these models.(4) Further research showed that when the distribution of respondents' abilities was broader than the distribution of item difficulty, the advantage of IREM for the accuracy of respondents' ability estimation was more evident (Some results are reported in "Appendix C" of the supplement).This seems to imply that respondents who deviated from item difficulty were more likely to be misestimated in ability if withinperson variances were neglected.This necessitates further mathematical derivation and empirical research. https://doi.org/10.1038/s41598-024-62479-0www.nature.com/scientificreports/

Figure 1 .
Figure 1.Comparison of RMSE of three models under different conditions.

Figure 2 .
Figure 2. ε Standard error and its estimated value under various conditions.

Figure 3 .
Figure 3.Estimated values of test parameters models.

Table 2 .
RMSE values of potential trait levels of respondents under various conditions generated by the new model.

Table 3 .
Relative fitting index of model.

Table 4 .
Response matrix of three respondents.
(3) The practical effects of the 1PL, 2PL, and IREM are compared using the 2015 PISA Parents' Support for Science Learning Questionnaire in Middle Childhood.The fit indices show improvement, and they are sensitive to the differences in the response pattern.